Objectives

We aim at developing a method to align and compare topics when

  1. the number of topics is changed (varying K)

  2. hyper-parameters of LDA are changed (e.g. varying alpha)

  3. different modalities [? is there a better term than “modality” ?] exists for the same documents. For example, the same set of documents exists in different languages but we don’t have a direct translation of each word. Or, more commonly encountered in biology, the same samples have been analyzed for different -omics information, e.g. metagenomic or transcriptomic, and there is a desire to compare the topics from these different domains.

Notations

Models

We aim to compare the topics of \(M\) LDA models. Each specific model is denoted by \(m \in [1:M]\).

Topics

Each model \(M\) has \(K\) topics. Each topic is denoted by \(k \in [1:K]\).

Documents (Samples)

The dataset is composed of \(D\) documents (or samples). Each document/sample is denoted by \(d \in [1:D]\).

Words (features)

The dataset contains counts for a set of \(W\) words (or features). In biology, these features would be genes, transcripts, proteins, bacterial species, etc. Each word is denoted by an index \(w \in [1:W]\). The number of word \(w\) found in a specific document \(d\) is denoted by \(c_{w,d}\).

LDA model matrices

LDA models are defined by two matrices:

In order words, an LDA finds topics such that each document is optimally described as a mixture of topics (\(\gamma\)), themselves characterized by a word probability (\(\beta\)).

Aligning topics.

For objectives (1) and (2), we can align topics using the \(\beta\) matrices from each model \(m\), while for objective (3), only matrix \(\gamma\) can be used to align topics.

We will thus first consider the problem of aligning topics using the \(\gamma\) matrices, then consider the “inverse” problem of aligning topics using the \(\beta\) matrices and discuss similarities and differences.

In both case, in addition to aligning topics between successive models (e.g. successive values of K or \(alpha\), or manually ordered modalities), we are also interested in computing and visualizing the alignment between each model and a reference model \(m^R\).

Aligning topics based on the \(\gamma\) matrices

First, each document \(d\) is assigned a topic of reference \(k_R\) which is defined as the topic of the reference model \(m_R\) with the largest proportion for this document: \(k^R_d = \arg \max \gamma_{d,k^R}\).

We then compute the proportion of mass transferred between each topic of successive models as \(w^{\gamma}_{k^m, k^{m+1}} = \frac{1}{D} \sum_{d}^D \gamma_{d,k^m} \ \gamma_{d, k^{m+1}}\)

And if we desire to split these weights by reference topics, we have $w^{}{k^m, k{m+1},kR} = d{DR}  {d,k^m}  {d, k^{m+1}} $.

Consequently, the “height” of each topic \(h_{k^m}\) is \(h_{k^m} = \sum_d \gamma_{d,k^m}\). Topics that are the main topics of many documents have a larger “height” that topics that are secondary topics of many documents or the main topic of few documents.

Aligning topics based on the \(\beta\) matrices

To align topics based on the distribution of word probability in these topics, we first define the following concepts:

  • the average word frequency: \(f_w = \frac{1}{D} \sum_d^D f_{w,d}\) with \(f_{w,d} = \frac{c_{w,d}}{\sum_w^W c_{w,d}}\)

  • the “topic height”: \(h_{k^m} = \sum_w^W f_w \ \beta_{w,k}\)

  • the “reference topic height” in each topic: \(h_{k^m, k^R} = \sum_w^{W^R} f_w \ \beta_{w,k^m}\)

[NOTE: these definitions are sufficient to draw the composition of each topic for each \(m\), but to draw the flow between the topics, we need to find the optimal mass transfer - I kept writing down my notes, but it’s not super useful and it’s not implemented, instead, I implemented something a little ugly for the visualization of the flows]

  • the “word height” in each topic: \(h_{w, k^m} = \beta_{w,k^m} \ h_{k^m}\)

  • the modeled “word height” over all topics: XXXX

Implementation

We have implemented the methods described above in a series of functions which can be ran sequentially:

Below is an example of how these functions are used on vaginal microbiome data.

Example 1: varying \(K\)

# Libraries to attach
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(topicmodels)
library(slam)

# load the topic alignment functions
source("align_topic_functions.R")

# viz default theme
theme_set(theme_minimal())
load(file = "vm_16s_data.Rdata", verbose = TRUE)
## Loading objects:
##   vm_16s
new_asv_names =  colnames(vm_16s) %>% 
  str_split_fixed(., " ", n = 8) %>%
  as.matrix() %>% .[,c(6, 7, 8)]  %>%
  as.data.frame() %>% 
  set_colnames(c("genus","species","strain")) %>% 
   mutate(short_name = 
            str_c(genus, " ", 
                  species %>% str_replace(.,"NA","-")," ",
                  strain)) %>% 
  select(short_name) %>% unlist()

j = which(duplicated(new_asv_names))
new_asv_names[j] = str_c(new_asv_names[j], " (", 1:length(j),")")
colnames(vm_16s) = new_asv_names
  
vm_16s <- slam::as.simple_triplet_matrix(vm_16s %>%  round())
topic_models_dir = "lda_models/"

lda_models = 
  run_lda_models(
    data = vm_16s,
    Ks = 1:13,
    method = "VEM",
    seed = 2,
    dir = topic_models_dir
  )

names(lda_models)
## [1] "betas"  "gammas"
head(lda_models$betas)
## # A tibble: 6 x 5
##   m         K k_LDA w                              b
##   <fct> <dbl> <chr> <chr>                      <dbl>
## 1 1         1 a     Lactobacillus iners 1     0.297 
## 2 1         1 a     Lactobacillus crispatus 1 0.239 
## 3 1         1 a     Lactobacillus iners 2     0.0353
## 4 1         1 a     Lactobacillus gasseri 1   0.0448
## 5 1         1 a     Megasphaera - 1           0.0346
## 6 1         1 a     Lactobacillus jensenii 1  0.0392
head(lda_models$gammas)
## # A tibble: 6 x 5
##   m         K k_LDA d              g
##   <fct> <dbl> <chr> <chr>      <dbl>
## 1 1         1 a     1005601068     1
## 2 1         1 a     1005601078     1
## 3 1         1 a     1005601088     1
## 4 1         1 a     1005601098     1
## 5 1         1 a     1005601108     1
## 6 1         1 a     1005601118     1
aligned_topics = 
  align_topics(
    data = asv_for_topic, 
    lda_models = lda_models
  )

names(aligned_topics)
## [1] "lda_models"      "gamma_alignment" "topics_order"
head(aligned_topics$gamma_alignment)
## # A tibble: 6 x 10
##   m     m_next m_ref k_LDA k_LDA_next k_LDA_ref       w     k k_next k_ref
##   <fct> <fct>  <fct> <chr> <chr>      <chr>       <dbl> <int>  <int> <int>
## 1 1     2      13    a     a          a         0.00639     1      1     3
## 2 1     2      13    a     a          b         0.0104      1      1    13
## 3 1     2      13    a     a          c         0.00110     1      1    10
## 4 1     2      13    a     a          d         0.0141      1      1    12
## 5 1     2      13    a     a          e         0.00467     1      1     4
## 6 1     2      13    a     a          f         0.0666      1      1     5
# head(aligned_topics$beta_alignment) # not implemented
ggplot(aligned_topics$topics_order, aes(x = m, y = k, col = k_LDA)) + 
  geom_text(aes(label = k_LDA)) + guides(col = FALSE)

g_aligned_topics = 
  visualize_aligned_topics(
    aligned_topics = aligned_topics,
    add_leaves = TRUE,
    min_beta = 0.05,
    add_words_labels = TRUE
    )

g_aligned_topics

g_aligned_topics = 
  visualize_aligned_topics(
    aligned_topics = aligned_topics,
    add_leaves = FALSE
    )

g_aligned_topics

g_aligned_topics_ref = 
  visualize_aligned_topics(
    aligned_topics = aligned_topics,
    color_by = "reference",
    add_leaves = FALSE
    )

g_aligned_topics_ref

g_aligned_topics_ref = 
  visualize_aligned_topics(
    aligned_topics = aligned_topics,
    color_by = "reference",
    add_leaves = TRUE
    )

g_aligned_topics_ref

Example 2: varying \(\alpha\)

Example 3: aligning topic accross modalities